2 research outputs found
Eliciting New Wikipedia Users' Interests via Automatically Mined Questionnaires: For a Warm Welcome, Not a Cold Start
Every day, thousands of users sign up as new Wikipedia contributors. Once
joined, these users have to decide which articles to contribute to, which users
to seek out and learn from or collaborate with, etc. Any such task is a hard
and potentially frustrating one given the sheer size of Wikipedia. Supporting
newcomers in their first steps by recommending articles they would enjoy
editing or editors they would enjoy collaborating with is thus a promising
route toward converting them into long-term contributors. Standard recommender
systems, however, rely on users' histories of previous interactions with the
platform. As such, these systems cannot make high-quality recommendations to
newcomers without any previous interactions -- the so-called cold-start
problem. The present paper addresses the cold-start problem on Wikipedia by
developing a method for automatically building short questionnaires that, when
completed by a newly registered Wikipedia user, can be used for a variety of
purposes, including article recommendations that can help new editors get
started. Our questionnaires are constructed based on the text of Wikipedia
articles as well as the history of contributions by the already onboarded
Wikipedia editors. We assess the quality of our questionnaire-based
recommendations in an offline evaluation using historical data, as well as an
online evaluation with hundreds of real Wikipedia newcomers, concluding that
our method provides cohesive, human-readable questions that perform well
against several baselines. By addressing the cold-start problem, this work can
help with the sustainable growth and maintenance of Wikipedia's diverse editor
community.Comment: Accepted at the 13th International AAAI Conference on Web and Social
Media (ICWSM-2019
Citations with identifiers in Wikipedia
<p>This dataset includes a list of citations with identifiers extracted from the most recent version of Wikipedia across all language editions. The data was parsed from the Wikipedia content dumps published on March 1, 2018.</p>
<p><strong>License</strong><br></p>
<p>All files included in this datasets are released under CC0: https://creativecommons.org/publicdomain/zero/1.0/</p>
<p><strong>Projects</strong><br></p>
<p>Previous versions of this dataset ("Scholarly citations in Wikipedia") were limited to the English language edition. The current version includes one dataset for each of the 298 languages editions that Wikipedia supports as of March 2018. Projects are identified by their ISO 639-1/639-2 language code, per https://meta.wikimedia.org/wiki/List_of_Wikipedias.</p>
<p><strong>Identifiers</strong><br></p>
<p>• PubMed IDs (pmid) and PubMedCentral IDs (pmcid).<br>• Digital Object Identifiers (doi)</p><p>• International Standard Book Number (isbn)</p><p>• ArXiv Ids (arxiv)</p>
<p><strong>Format</strong><br></p>
<p>Each row in the dataset represents a citation as a (Wikipedia article, cited source) pair. Metadata about when the citation was first added is included.</p>
<p>• page_id -- The identifier of the Wikipedia article (int), e.g. <em>1325125<br>• </em>page_title -- The title of the Wikipedia article (utf-8), e.g.<em> Club cell<br>• </em>rev_id -- The Wikipedia revision where the citation was first added (int), e.g.<em> 282470030<br>• </em>timestamp -- The timestamp of the revision where the citation was first added. (ISO 8601 datetime), e.g.<em> 2009-04-08T01:52:20Z<br>• </em>type -- The type of identifier, e.g.<em> pmid<br>• </em>id -- The id of the cited source (utf-8), e.g.<em> 18179694</em></p>
<p><strong>Source code</strong><br></p>
<p>https://github.com/halfak/Extract-scholarly-article-citations-from-Wikipedia (MIT Licensed)</p>
<p>A copy of this dataset is also available at https://analytics.wikimedia.org/datasets/archive/public-datasets/all/mwrefs/</p><p><strong>Notes</strong><br></p>
<p>Citation identifers are extracted as-is from Wikipedia article content. Our spot-checking suggests that 98% of identifiers resolve.</p